Spotify Dataset

The purpose of this project is to build a songs recommendation system, based on the user's music preference.

This dataset contains different features of 600k+ tracks.

  1. tracks.csv: The audio features of tracks.
  2. artists.csv: The popularity metrics of artists.

All data collected in this dataset was obtained through Spotify API by Yamac Eren .

The dataset was obtained in kaggle

Description of the dataset

For more in-depth information about audio features provided by Spotify: https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features

Data Cleansen and Exploratory Data Analysis

Cleaning tracks data

Lets import our libraries

Every row of this dataframe is a song, with it's own features.

The dataset contains 586,672 rows and 20 columns.

We have a dataset with almost no null values, this is reasonable, as every song as has an unique ID, every song has its own properties.

At this point, notice that both artists and id_artists are strings, and are not parsed as List objects, lets convert them.

Now either artists and id_artists are List objects, and we can use it's properties and methods.

As id_artists is a list containing only one element, let's get the only element of the list and convert it into a str

Cleaning artist data

Notice genres column is parsed as a str, when it should be a List. Lets fix it

Lets rename the columns for avoiding problems when performing a join.

Exploratory Data Analysis

Lets join our dataframes by the artist ID so we have all the columns we need.

We have no duplicated data

Let's deep dive into the dataset distribution

Helper Functions

Popularity Analysis

We can see there are more than 40,000 artists in the dataset that have the lowest popularity score. The mean is 27.5 and the median 27.

Minute duration Analysis

Lets take a look to the song with a duration larger then 5 mins

We have 82473 songs with a duration larger than 5 Mins!

Explicit Songs Analysis

Release Date Analysis

Danceability feature Analysis

Looks like most of our sample is quite danceable!

Note that Q2 is 0.45 and Q3 is 0.68, most of the songs in this dataset are danceable. The IQR is 0.233

Energy feature Analysis

This distribution is interesting! Note that the energy feature looks like a downward parabola!

Key feature Analysis

Note that the most used keys in our sample are: 0 = C, 2 D, 7 = G and 9 = A

Key feature Analysis

This distribution is quite left skewed. Also in the boxplot we can note many outliers

Mode feature Analysis

65% of our sample songs are Major scale, while 35% are minor.

Speechiness feature Analysis

Wow! Our dataset is mostly musical-feature audio. Notice there are few tracks that have a high speechiness value. Lets look at them

There are 22598 tracks are probably made entirely of spoken words.

Acousticness feature Analysis

The distribution looks quite uniform between 0.09 and 0.9, white we have heavy tails en the extreme values!

There are manys songs that are un-acoustic!

Instrumentalness feature Analysis

This is important! most of the distribution of this variable is concentrated in values between 0 and 0.009. We have to deep diver to understand this phenomenum

liveness feature Analysis

Looks like most of our tracks were not perfomanced live.

There are 14312 that have a liveness value > 8, provides strong likelihood that the track is live.

Valence feature Analysis

There is a peak ot the valence feature, this is quite strange, as the entire variable is quite uniform across the possible values.

Tempo feature Analysis

There are some pikes on the distribution, and many values that may be outliers!

Time Signature feature Analysis

We have:

Explore Linear relationship between variables

Some important points here:

Let's obtain all the pair-columns that have weak, moderate or strong correlation, either negative or positive